An empirical Bayes method for gene expression analysis in R
نویسندگان
چکیده
In recent years the new technology of microarrays has made it feasible to measure expression of thousands of genes to identify changes between different biological states. In such biological experiments we are confronted with the problem of high-dimensionality because of thousands of genes involved and at the same time with small sample sizes (due to limited availability of cases). The set of differentially expressed genes is unknown and the number of its elements relatively small. Due to a lack of biological background information this is a statistically and computationally demanding task. The fundamental question we wish to address is differential gene expression. The standard statistical approach is significance testing. The null hypothesis for each gene is that the data we observe have some common distributional parameter among the conditions, usually the mean of the expression levels. Taking this approach, for each gene a statistic is calculated that is a function of the data. Apart from the type I error (false positive) and the type II error (false negative) there is the complication of testing multiple hypotheses simultaneously. Each gene has individual type I and II errors. Hence compound error measures are required. Recently several measures have been suggested ([1]). Their selection is far from trivial and their calculation computationally expensive. As an alternative to testing we propose an empirical Bayes thresholding (EBT) approach for the estimation of possibly sparse sequences observed with white noise (modest correlation is tolerable). A sparse sequence consists of a relatively small number of informative measurements (in which the signal component is dominating) and a very large number of noisy zero measurements. Gene expression analysis fits into this concept. For that purpose we apply a new method outlined in [5]. It circumvents the complication of multiple testing. More than that, user-specified parameters are not needed, apart from distributional assumptions. This automatic and computationally efficient thresholding technique is implemented in R. The practical relevance of EBT is demonstrated for cDNA measurements. The preprocessing steps and the identification of differentially expressed genes is performed using R functions ([4]) and Bioconductor libraries ([3]). Finally comparisons with selected testing approaches based on compound error measures available in multtest ([2]) are shown.
منابع مشابه
THE EMPIRICAL BAYES METHOD OF ANALYSIS OF A SERIES OF EXPERIMENTS
The classical method of analysis of a series of experiments is somewhat involved in being conditional on various, occasionally unrealistic, assumptions such as homogeneity of variances of experimental error, lack of interactions of treatments and places,etc. In this work, we adopt a Bayesian view to account for such heterogeneities. Our appoach is illustrated by a real series of experiment...
متن کاملEMPIRICAL BAYES ANALYSIS OF TWO-FACTOR EXPERIMENTS UNDER INVERSE GAUSSIAN MODEL
A two-factor experiment with interaction between factors wherein observations follow an Inverse Gaussian model is considered. Analysis of the experiment is approached via an empirical Bayes procedure. The conjugate family of prior distributions is considered. Bayes and empirical Bayes estimators are derived. Application of the procedure is illustrated on a data set, which has previously been an...
متن کاملEmpirical Bayes Estimation in Nonstationary Markov chains
Estimation procedures for nonstationary Markov chains appear to be relatively sparse. This work introduces empirical Bayes estimators for the transition probability matrix of a finite nonstationary Markov chain. The data are assumed to be of a panel study type in which each data set consists of a sequence of observations on N>=2 independent and identically dis...
متن کاملInvariant Empirical Bayes Confidence Interval for Mean Vector of Normal Distribution and its Generalization for Exponential Family
Based on a given Bayesian model of multivariate normal with known variance matrix we will find an empirical Bayes confidence interval for the mean vector components which have normal distribution. We will find this empirical Bayes confidence interval as a conditional form on ancillary statistic. In both cases (i.e. conditional and unconditional empirical Bayes confidence interval), the empiri...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملBioinformatic and empirical analysis of a gene encoding serine/threonine protein kinase regulated in response to chemical and biological fertilizers in two maize (Zea mays L.) cultivars
Molecular structure of a gene, ZmSTPK1, encoding a serine/threonine protein kinase in maize was analyzed by bioinformatic tool and its expression pattern was studied under chemical biological fertilizers. Bioinformatic analysis cleared that ZmSTPK1 is located on chromosome 10, from position 141015332 to 141017582. The full genomic sequence of the gene is 2251 bp in length and includes 2 exons. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004